2. Time Series Graphics

2.1 tsibble objects

How to create a tsibble:

library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ tibble      3.1.5     ✓ tsibble     1.0.1
## ✓ dplyr       1.0.7     ✓ tsibbledata 0.3.0
## ✓ tidyr       1.1.4     ✓ feasts      0.2.2
## ✓ lubridate   1.8.0     ✓ fable       0.3.1
## ✓ ggplot2     3.3.5
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
y <- tsibble(
  Year = 2015:2019,
  Observation  = c(123, 39, 78, 52, 110),
  index = Year
)

y
## # A tsibble: 5 x 2 [1Y]
##    Year Observation
##   <int>       <dbl>
## 1  2015         123
## 2  2016          39
## 3  2017          78
## 4  2018          52
## 5  2019         110

For observations that are more frequent than once per year, we need to use a time class function on the index. For example:

olympic_running
## # A tsibble: 312 x 4 [4Y]
## # Key:       Length, Sex [14]
##     Year Length Sex    Time
##    <int>  <int> <chr> <dbl>
##  1  1896    100 men    12  
##  2  1900    100 men    11  
##  3  1904    100 men    11  
##  4  1908    100 men    10.8
##  5  1912    100 men    10.8
##  6  1916    100 men    NA  
##  7  1920    100 men    10.8
##  8  1924    100 men    10.6
##  9  1928    100 men    10.8
## 10  1932    100 men    10.3
## # … with 302 more rows

The 14 time series in this object are uniquely identified by the keys: the Length and Sex variables. The distinct() function can be used to show the categories of each variable or even combinations of variables:

olympic_running %>% distinct(Sex)
## # A tibble: 2 × 1
##   Sex  
##   <chr>
## 1 men  
## 2 women

Working with tsibble objects

We can use dplyr functions such as mutate(), filter(), select() and summarise() to work with tsibble objects.

PBS
## # A tsibble: 67,596 x 9 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [336]
##       Month Concession  Type   ATC1  ATC1_desc   ATC2  ATC2_desc   Scripts  Cost
##       <mth> <chr>       <chr>  <chr> <chr>       <chr> <chr>         <dbl> <dbl>
##  1 1991 Jul Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   18228 67877
##  2 1991 Aug Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   15327 57011
##  3 1991 Sep Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   14775 55020
##  4 1991 Oct Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   15380 57222
##  5 1991 Nov Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   14371 52120
##  6 1991 Dec Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   15028 54299
##  7 1992 Jan Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   11040 39753
##  8 1992 Feb Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   15165 54405
##  9 1992 Mar Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   16898 61108
## 10 1992 Apr Concession… Co-pa… A     Alimentary… A01   STOMATOLOG…   18141 65356
## # … with 67,586 more rows

Filter the dataset to extract the A10 scripts:

a10 <- PBS %>% filter(ATC2 == "A10")
a10
## # A tsibble: 816 x 9 [1M]
## # Key:       Concession, Type, ATC1, ATC2 [4]
##       Month Concession  Type   ATC1  ATC1_desc   ATC2  ATC2_desc  Scripts   Cost
##       <mth> <chr>       <chr>  <chr> <chr>       <chr> <chr>        <dbl>  <dbl>
##  1 1991 Jul Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   89733 2.09e6
##  2 1991 Aug Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   77101 1.80e6
##  3 1991 Sep Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   76255 1.78e6
##  4 1991 Oct Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   78681 1.85e6
##  5 1991 Nov Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   70554 1.69e6
##  6 1991 Dec Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   75814 1.84e6
##  7 1992 Jan Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   64186 1.56e6
##  8 1992 Feb Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   75899 1.73e6
##  9 1992 Mar Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   89445 2.05e6
## 10 1992 Apr Concession… Co-pa… A     Alimentary… A10   ANTIDIABE…   97315 2.23e6
## # … with 806 more rows

Select the columns we need in our analysis:

a10 <- a10 %>% select(Month, Concession, Type, Cost)
a10
## # A tsibble: 816 x 4 [1M]
## # Key:       Concession, Type [4]
##       Month Concession   Type           Cost
##       <mth> <chr>        <chr>         <dbl>
##  1 1991 Jul Concessional Co-payments 2092878
##  2 1991 Aug Concessional Co-payments 1795733
##  3 1991 Sep Concessional Co-payments 1777231
##  4 1991 Oct Concessional Co-payments 1848507
##  5 1991 Nov Concessional Co-payments 1686458
##  6 1991 Dec Concessional Co-payments 1843079
##  7 1992 Jan Concessional Co-payments 1564702
##  8 1992 Feb Concessional Co-payments 1732508
##  9 1992 Mar Concessional Co-payments 2046102
## 10 1992 Apr Concessional Co-payments 2225977
## # … with 806 more rows

Summarise allows us to combine data across keys:

a10 <- a10 %>% summarise(TotalCost = sum(Cost))
a10
## # A tsibble: 204 x 2 [1M]
##       Month TotalCost
##       <mth>     <dbl>
##  1 1991 Jul   3526591
##  2 1991 Aug   3180891
##  3 1991 Sep   3252221
##  4 1991 Oct   3611003
##  5 1991 Nov   3565869
##  6 1991 Dec   4306371
##  7 1992 Jan   5088335
##  8 1992 Feb   2814520
##  9 1992 Mar   2985811
## 10 1992 Apr   3204780
## # … with 194 more rows

Create new variables using the mutate function:

a10 <- a10 %>% mutate(TotalCost = TotalCost / 1e6)
a10
## # A tsibble: 204 x 2 [1M]
##       Month TotalCost
##       <mth>     <dbl>
##  1 1991 Jul      3.53
##  2 1991 Aug      3.18
##  3 1991 Sep      3.25
##  4 1991 Oct      3.61
##  5 1991 Nov      3.57
##  6 1991 Dec      4.31
##  7 1992 Jan      5.09
##  8 1992 Feb      2.81
##  9 1992 Mar      2.99
## 10 1992 Apr      3.20
## # … with 194 more rows

Read a CSV file and convert to a table:

prison <- readr::read_csv('https://OTexts.com/fpp3/extrafiles/prison_population.csv')
## Rows: 3072 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): State, Gender, Legal, Indigenous
## dbl  (1): Count
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prison <- prison %>% 
  mutate(Quarter = yearquarter(Date)) %>% 
  select(-Date) %>% 
  as_tsibble(key = c(State, Gender, Legal, Indigenous),
             index = Quarter)
prison
## # A tsibble: 3,072 x 6 [1Q]
## # Key:       State, Gender, Legal, Indigenous [64]
##    State Gender Legal    Indigenous Count Quarter
##    <chr> <chr>  <chr>    <chr>      <dbl>   <qtr>
##  1 ACT   Female Remanded ATSI           0 2005 Q1
##  2 ACT   Female Remanded ATSI           1 2005 Q2
##  3 ACT   Female Remanded ATSI           0 2005 Q3
##  4 ACT   Female Remanded ATSI           0 2005 Q4
##  5 ACT   Female Remanded ATSI           1 2006 Q1
##  6 ACT   Female Remanded ATSI           1 2006 Q2
##  7 ACT   Female Remanded ATSI           1 2006 Q3
##  8 ACT   Female Remanded ATSI           0 2006 Q4
##  9 ACT   Female Remanded ATSI           0 2007 Q1
## 10 ACT   Female Remanded ATSI           1 2007 Q2
## # … with 3,062 more rows

2.2 Time plots

For time series data, the obvious graph to start with is a time plot. That is, the observations are plotted against the time of observation, with consecutive observations joined by straight lines. Figure 2.1 shows the weekly economy passenger load on Ansett airlines between Australia’s two largest cities.

melsyd_economy <- ansett %>% 
  filter(Airports == "MEL-SYD", Class == "Economy") %>% 
  mutate(Passengers = Passengers / 1000)
autoplot(melsyd_economy, Passengers) +
  labs(title = "Ansett airline economy class",
       subtitle = "Melbourne-Sydney",
       y = "Passengers ('000)")

The time plot immediately reveals some interesting features.

• There was a period in 1989 when no passengers were carried — this was due to an industrial dispute.
• There was a period of reduced load in 1992. This was due to a trial in which some economy class seats were replaced by business class seats.
• A large increase in passenger load occurred in the second half of 1991.
• There are some large dips in load around the start of each year. These are due to holiday effects. • There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.
• There are some periods of missing observations.

Let’s take a look at the a10 data saved earlier, plotted as a time series:

autoplot(a10, TotalCost) +
  labs(y = "$ millionss",
       title = "Australian antidiabetic drug sales")

2.3 Time Series Patterns

Trend
A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as “changing direction,” when it might go from an increasing trend to a decreasing trend. There is a trend in the antidiabetic drug sales data shown in Figure 2.2.
Seasonal
A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known period. The monthly sales of antidiabetic drugs (Figure 2.2) shows seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.
Cyclic
A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle.” The duration of these fluctuations is usually at least 2 years.

2.4 Seasonal plots

A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. An example is given in Figure 2.4 showing the antidiabetic drug sales.

a10 %>%
  gg_season(TotalCost, labels = "both") +
  labs(y = "$ millions",
       title = "Seasonal plot: Antidiabets drug sales") +
  expand_limits(x = ymd(c("1972-12-28", "1973-12-04")))

Multiple seasonal periods

Where the data has more than one seasonal pattern, the period argument can be used to select which seasonal plot is required. The vic_elec data contains half-hourly electricity demand for the state of Victoria, Australia. We can plot the daily pattern, weekly pattern or yearly pattern by specifying the period argument.

vic_elec %>% gg_season(Demand, period = "day") +
  theme(legend.position = "none") +
  labs(y = "MW", title = "Electricity demand: Victoria")

vic_elec %>% gg_season(Demand, period = "week") +
  theme(legend.position = "none") +
  labs(y = "MW", title = "Electricity demand: Victoria by day of the week")

vic_elec %>% gg_season(Demand, period = "year") +
  labs(y="Megawatts", title = "Electricity demand: Victoria") 

2.5 Seasonal subseries plots

An alternative plot that emphasises the seasonal patterns is where the data for each season are collected together in separate mini time plots.

a10 %>% 
  gg_subseries(TotalCost) +
  labs(
    y = "$ millions",
    title = "Australian antidiabetic drug sales"
  )

Example: Australian holiday tourism

holidays <- tourism %>% 
  filter(Purpose == "Holiday") %>% 
  group_by(State) %>% 
  summarise(Trips = sum(Trips))

holidays
## # A tsibble: 640 x 3 [1Q]
## # Key:       State [8]
##    State Quarter Trips
##    <chr>   <qtr> <dbl>
##  1 ACT   1998 Q1  196.
##  2 ACT   1998 Q2  127.
##  3 ACT   1998 Q3  111.
##  4 ACT   1998 Q4  170.
##  5 ACT   1999 Q1  108.
##  6 ACT   1999 Q2  125.
##  7 ACT   1999 Q3  178.
##  8 ACT   1999 Q4  218.
##  9 ACT   2000 Q1  158.
## 10 ACT   2000 Q2  155.
## # … with 630 more rows
autoplot(holidays, Trips) +
  labs(y = "Overnight trips '000",
       title = "Australian domestic holidays") +
  facet_grid(~State)

To see the timing of the seasonal peaks in each state, we can use a season plot.

gg_season(holidays, Trips) +
  labs(y = "Overnight trips '000",
       title = "Australian domestic holidays")

holidays %>%
  gg_subseries(Trips) +
  labs(y = "Overnight trips '000",
       titla = "Australian domestic holidays")

2.6 Scatterplots

The graphs discussed so far are useful for visualising individual time series. It is also useful to explore relationships between time series.

vic_elec %>% 
  filter(year(Time) == 2014) %>% 
  autoplot(Demand)+
  labs(y = "Gigawatts",
       title = "Half-hourly electricity demand: Victoria")

vic_elec %>% 
  filter(year(Time) == 2014) %>% 
  autoplot(Temperature) +
  labs(
    y = "Degrees Celsius",
    title = "Half-hourly temperatures: Melbourne, Australia"
  )

vic_elec %>% 
  filter(year(Time) == 2014) %>% 
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point() +
  labs(x = "Temperature (degrees Celsius)",
    title = "Electricity demand (gigawatts)")

Correlation

\[r = \frac{\sum_{i=0}^n\left(x_t - \bar{x} \right)\left(y_t - \bar{y} \right)^2}{\sqrt{\left(x_t - \bar{x} \right)^2}\sqrt{\left(y_t - \bar{y} \right)^2}}\]

Scatterplot matrices

When there are several potential predictor variables, it is useful to plot each variable against each other variable.

visitors <- tourism %>% 
  group_by(State) %>% 
  summarise(Trips = sum(Trips))
visitors %>% 
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(vars(State), scales = "free_y") +
  labs(title = "Australian domestic tourism",
       y = "Overnight trips '000")

To see the relationships between these eight time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, as shown in Figure 2.18. (This plot requires the GGally package to be installed.)

visitors %>% 
  pivot_wider(values_from = Trips, names_from = State) %>% 
  GGally::ggpairs(columns = 2:9)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

For each panel, the variable on the vertical axis is given by the variable name in that row, and the variable on the horizontal axis is given by the variable name in that column. There are many options available to produce different plots within each panel. In the default version, the correlations are shown in the upper right half of the plot, while the scatterplots are shown in the lower half. On the diagonal are shown density plots.

The value of the scatterplot matrix is that it enables a quick view of the relationships between all pairs of variables. In this example, mostly positive relationships are revealed, with the strongest relationships being between the neighbouring states located in the south and south east coast of Australia, namely, New South Wales, Victoria and South Australia. Some negative relationships are also revealed between the Northern Territory and other regions. The Northern Territory is located in the north of Australia famous for its outback desert landscapes visited mostly in winter. Hence, the peak visitation in the Northern Territory is in the July (winter) quarter in contrast to January (summer) quarter for the rest of the regions.

2.7 Lab plots

Figure 2.19 displays scatterplots of quarterly Australian beer production (introduced in Figure 1.1), where the horizontal axis shows lagged values of the time series. Each graph shows \[y_t \textrm{ is plotted against } y_{t-k} \textrm{ for different values of k}\]

recent_production <- aus_production %>% 
  filter(year(Quarter) >= 2000)
recent_production %>% 
  gg_lag(Beer, geom = "point") +
  labs(x = "lab(Beerk, k")

Here the colours indicate the quarter of the variable on the vertical axis. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data. The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)

2.8 Autocorrelation

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

\[r_k = \frac{\sum_{t = k+1}^T\left(y_t - \bar{y} \right)\left(y_{t-k} - \bar{y} \right)}{\sum_{t = 1}^T{\left(y_t - \bar{y} \right)^2}}\]

where T is the length of the time series. The autocorrelation coefficients make up the autocorrelation function or ACF.

The autocorrelation coefficients for the beer production data can be computed using the ACF() function.

recent_production %>% ACF(Beer, lab_max = 9)
## Warning: The `...` argument of `PACF()` is deprecated as of feasts 0.2.2.
## ACF variables should be passed to the `y` argument. If multiple variables are to be used, specify them using `vars(...)`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: ACF currently only supports one column, `Beer` will be used.
## # A tsibble: 16 x 2 [1Q]
##      lag      acf
##    <lag>    <dbl>
##  1    1Q -0.0530 
##  2    2Q -0.758  
##  3    3Q -0.0262 
##  4    4Q  0.802  
##  5    5Q -0.0775 
##  6    6Q -0.657  
##  7    7Q  0.00119
##  8    8Q  0.707  
##  9    9Q -0.0888 
## 10   10Q -0.588  
## 11   11Q  0.0241 
## 12   12Q  0.632  
## 13   13Q -0.0660 
## 14   14Q -0.505  
## 15   15Q -0.00891
## 16   16Q  0.498

The values in the acf column are \[r_1...r_9\], corresponding to the nine scatterplots in Figure 2.19. We usually plot the ACF to see how the correlations change with the lag
k. The plot is sometimes known as a correlogram.

recent_production %>% 
  ACF(Beer) %>% 
  autoplot() + labs(title = "Australian beer production")

In this graph: • \[r_4\] is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be four quarters apart. • \[r_2\] is more negative than for the other lags because troughs tend to be two quarters behind peaks. The dashed blue lines indicate whether the correlations are significantly different from zero (as explained in Section 2.9).

Trend and seasonality in Autocorrelation plots

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.

When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.

a10 %>% 
  ACF(TotalCost, lag_max = 48) %>% 
  autoplot() +
  labs(title = "Australian antidiabetic drug sales")

2.9 White noise

Time series that show no autocorrelation are called white noise. Figure 2.22 gives an example of a white noise series.

set.seed(30)
y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)
y %>% autoplot(wn) + labs(title = "White noise", y = "")

y %>% 
  ACF(wn) %>% 
  autoplot() + labs(title = "White noise")

Completion of text for chapter 2.